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Abstract 

Valiant's (2007) model of evolvability models the evolutionary process of acquiring useful functional- 
ity as a restricted form of learning from random examples. Linear threshold functions and their various 
subclasses, such as conjunctions and decision lists, play a fundamental role in learning theory and hence 
their evolvability has been the primary focus of research on Valiant's framework (2007). One of the main 
open problems regarding the model is whether conjunctions are evolvable distribution-independently (Feld- 
man and Valiant, 2008). We show that the answer is negative. Our proof is based on a new combinatorial 
parameter of a concept class that lower-bounds the complexity of learning from correlations. 

We contrast the lower bound with a proof that linear threshold functions having a non-negligible margin 
on the data points are evolvable distribution-independently via a simple mutation algorithm. Our algorithm 
relies on a non-linear loss function being used to select the hypotheses instead of 0-1 loss in Valiant's (2007) 
original definition. The proof of evolvability requires that the loss function satisfies several mild conditions 
that are, for example, satisfied by the quadratic loss function studied in several other works (Michael, 
2007; Feldman, 2009; Valiant, 2010). An important property of our evolution algorithm is monotonicity, 
that is the algorithm guarantees evolvability without any decreases in performance. Previously, monotone 
evolvability was only shown for conjunctions with quadratic loss (Feldman, 2009) or when the distribution 
on the domain is severely restricted (Michael, 2007; Feldman, 2009; Kanade et al. , 2010). 

1 Introduction 

Evolution is the source of the spectacularly complex organisms and behavior that we see around us. Yet 
we know very little about the computational mechanisms that can lead to such complexity while respecting 
the constraints of the Darwinian evolutionary process and using a plausible amount of resources. Recently 
Valiant suggested that an appropriate framework for unders landing the power of evolution to produce complex 
behavior is that of computational learning theory flS] since both evolution and learning involve processes that 
adapt their behavior on the basis of experience. Accordingly, in his model, evolvability of a certain useful 
functionality is cast as a problem of learning the desired functionality through a process in which, at each step, 
the most "fit" candidate function is chosen from a small pool of mutations of the current candidate. Limits on 
the number of steps and the amount of computation performed at each step are imposed to make this process 
naturally plausible. A class of functions C is considered evolvable if there exists a single representation 
scheme R and a mutation algorithm M onR that, when guided by such selection, guarantees convergence to 
the desired function for every function in C. Here the requirements closely follow those of the celebrated 



PAC learning model 112611 . In fact, every evolution algorithm (here and below in the sense defined in Valiant's 
model) can be simulated by an algorithm that is given random examples of the desired function. In addition, 
many properties of learning algorithms such as distribution-independence, weakness and attribute-efficiency 
apply equally to evolvability. 



1.1 Prior Work 



The constrained way in which evolution algorithms have to converge to the target function makes finding 
such algorithms a substantially more involved task than designing PAC learning algorithms. Initially, only 
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Table 1 : Positive results on evolvability. For the distribution entry "AH" refers to distribution-independent 
evolvability. 2) refers to any fixed set of distribution (including "All"). All results for Boolean loss also apply 
to all other loss functions. 



the evolvability of monotone conjunctions of Boolean variables, and only when the distribution over the 
domain is uniform, was demonstrated (if not specified otherwise, the domain is {0, 1)") |27]. Subsequently 
this result was simplified [3] and strengthened to general conjunctions il3lil4ll . Later Michael [22] described 
an algorithm for evolving decision lists over the uniform distribution that used a larger space of hypotheses 
and a different performance metric over hypotheses (specifically, quadratic loss). In our earlier work we 
showed that evolvability is, at least within polynomial limits, equivalent to learning by a natural restriction 
of well-studied statistical queries (SQ)[15], referred to as correlational statistical queries (CSQ) This 
result gives distribution-specific algorithms for any SQ learnable class of functions. By characterizingweak 
distribution-independent evolvability and using communication-complexity -based lower bounds i25[ ]2ll. we 
also proved that general linear threshold functions (also referred to as halfspaces) and even decision lists are 
not evolvable distribution-independently. 

In another work [jsj] we examined the relative power of a number of variants of the model discussed in 
Valiant's and other works flf\. Among them we considered a generalization of the model to real-valued 
hypotheses for which one needs to specify the loss function used to measure the loss in performance at every 
point. We demonstrated that a number of variants of the model are all equivalent to learning by CSQs and 
hence to the original model |5]. The only two properties which we found to influence the power of the model 
are the choice of the loss function (with the original 0/1 loss being equivalent to evolving with the linear 
loss) and monotonicity, or requirement that the performance of hypotheses does not decrease in the course of 
evolution. Valiant's original selection rule allows small decreases in performance^. This somewhat unnatural 
property has been exploited in all the results showing equivalence to learning by csqH and hence evolution 
algorithms obtained through such general transformations are non-monotone. In a recent work Kanade et al. 
114] show that the equivalence to learning by CSQs still holds if the total allowed decrease in performance is 
bounded by any non-negligible value chosen in advance (they refer to such algorithms as quasi-monotone). 
The first general transformation that yields monotone algorithms was given in our subsequent work [6] where 
we showed that every concept class SQ learnable over a fixed distribution D is evolvable monotonically 
over D when using quadratic loss. By exploiting some of the techniques of the general transformation, we 
also showed that conjunctions are evolvable distribution-independently when using quadratic loss ||6|]. We 
summarize these results and several other known evolution algorithms in Table [T] 

1.2 Our Results 

As can be seen from Table [T] evolvability of even the most basic concept classes is still only partially under- 
stood. Most notably, prior to this work it was unknown whether conjunctions are evolvable distribution- 
independently with Boolean loss (even without requiring monotonicity) and this question was posed by 
Valiant and the author as an open problem at COLT 2008 j?!- In our first result (Section |3) we show that 
the answer is negative. Specifically, we prove that for any k - w(l), monotone conjunctions of at most k 

' In this context we refer to empirical performance rattier than true expected performance. 

^The decreases in performance can be avoided if the evolution algorithm starts in a certain fixed state, i.e. is initialized. 
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variables are not evolvable distribution-independently to any accuracy e - o(l). Our technique is based on 
a new combinatorial parameter of a concept class that, roughly, measures the maximum number of correla- 
tional query functions required to distinguish every target function-distribution pair from a fixed function- 
distribution pair This general approach is based on our recent characterization of strong SQ learnability [6]. 
For a given size of conjunction k, we then come up with a construction of a set of conjunction-distribution 
pairs \(ts,Ds) \ \S\ - k] that cannot be distinguished from a constant function over the uniform distribution 
using a polynomial number of queries. The distribution is designed in such a way that it hides all Fourier 
coefficients of the conjunction up to degree A;/3. Simple facts from Fourier analysis of Boolean functions 
then imply that distinguishing between a superpolynomial number of such conjunction-distribution pairs is 
impossible using a polynomial number of queries. 

We interpret this negative result as highlighting significant limitations of evolvability based on the Boolean 
feedback only. We note that many functions in biological evolution are not Boolean. For example, for most 
genes the amount of gene expression (that is the amount of protein produced) can vary in a certain range 
continuously (up to, of course, the granularity of a single molecule). Therefore it is natural to assume that 
when evolving the optimal regulation of gene expression (described by a Boolean function), intermediate 
amounts of the protein will be produced. The intermediate values are likely to cause intermediate values of 
loss relative to the optimal or 1 value. It is therefore important to understand evolvability with other loss 
functions. Toward this goal in Section|4]we show that linear threshold functions are evolvable monotonically 
and distribution-independently for quadratic loss function and all other loss functions satisfying a set of mild 
conditions. We refer to loss functions that satisfy the required conditions as well-behaved. The amount of 
resources required by our algorithm depends quadratically on 1/y where y is the margin of the target halfs- 
pace on the data points. Therefore, like the famous Perceptron and Winnow algorithms [23l|2ll], it is efficient 
only when the margin is non-negligible or lower bounded by the inverse of a polynomial in n. In the Support 
Vector Machine (SVM) literature this condition is usually referred to as having a large margin. Further, the 
representation used by our evolution algorithm is similar to linear thresholds and the mutation algorithm is 
fairly simple and natural. The only operations it requires are adding the function a ■ x, to the the current 
function for a real a and bounding the value of the function to be in [-1, 1]. 

A very popular and powerful approach to learning when data points are not linearly separable is to embed 
the data points in a different (often higher dimensional) Euclidean space where the examples become linearly 
separable and then use a halfspace learning algorithm such as Perceptron or SVM to produce a classifier 
Such approach also works in the context of evolvability and implies monotone evolvability of any concept 
class that can be efficiently embedded into large-margin halfspaces over some Euclidean space (efficiency of 
the embedding also bounds the dimension of the space). Therefore our second result approaches some of 
the most important and strongest results for PAC learning while also being a natural algorithm in Valiant's 
framework of evolvability. 

We note that a similar mutation algorithm was used in our result for conjunctions However our 
analysis here is new and differs conceptually from the analysis for conjunctions which cannot be extended 
to halfspaces. It also gives substantially stronger bounds. For example, it improves the dependence of the 
improvement in each step on e from to e^. The key to this result for the quadratic loss function is a simple 
proof that for every distribution D, halfspace / and any real-valued function with the range in [-1, 1], there 
exists a variable x, that is correlated with the gradient of the loss function at point 0. The absolute value of 
the correlation is lower-bounded by the inverse of a polynomial in n, 1/e and 1 /y and therefore is sufficient 
to imply that a small step in the direction of jc, (or -jc,) will reduce the loss. 

A recent work by P. Valiant f?^ examines the extension of the model of evolvability to real-valued 
target functions. His results paint a picture quite similar to what we know about the evolvability of Boolean 
functions. In particular, his simple algorithm for evolving linear functions when using the quadratic loss can 
be seen as the counterpart of our algorithm for halfspaces. 

2 Preliminaries 

For a positive integer £, let [{] denote the set {1, 2, ... , {} and for / < { let denote the set {/, / + !,...,£}. 
We denote the domain of our learning problems by X. As usual it is parameterized by an (implicit) dimension 
«. A concept class over X is a set of {-1, l)-valued functions over X referred to as concepts. Let iFj"" denote 
the set of all functions from X to [-1,1] (that is all the functions with Loo norm bounded by 1). It will be 
convenient to view a distribution D over X as defining the product {(p, if')D - ^x~D[<Pix) ■ t//{x)] over the space 
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of real-valued functions on X. It is easy to see that this is simply a non-negatively weighted version of the 
standard dot product over and hence is a positive semi-inner product over R^. The corresponding norm is 
defined as UWd = ^JE^l<f>Hx)] = ^J{<f',4'}D■ 

Let B„ = {x \ \\xi\\ < 1) be the ball or radius 1 in R", X be a subset of B„, and / - sign(2 wiXi - 0) be 
a linear threshold function (halfspace). We define the margin y of / on X as y = inf vez{| 2 ~ ^W- For 
convenience we use xo to refer to the constant function 1 . 

2.1 PAC Learning 

The models we consider are based on the well-known PAC learning model introduced by Valiant f2^. Let 
C be a concept class over X. In the basic PAC model a learning algorithm is given examples of an unknown 
function / from C on points randomly chosen from some unknown distribution D over X and should produce 
a hypothesis h that approximates /. Formally, an example oracle EX(/, D) is an oracle that upon being 
invoked returns an example {x, fix)}, where x is chosen randomly with respect to D, independently of any 
previous examples. 

An algorithm is said to PAC learn C in time t if for every e > 0, / e C, and distribution D over X, the 
algorithm given e and access to EX(/, D) outputs, in time f and with probability at least 2/3, a hypothesis h 
that is evaluatable in time t and satisfies Pr/j[/(x) h(x)] < e. We say that an algorithm efficiently learns C 
when t is upper bounded by a polynomial in n, 1 /e. 

The basic PAC model is also referred to as distribution-independent learning to distinguish it from 
distribution-specific PAC learning in which the learning algorithm is required to learn only with respect 
to a single distribution D known in advance. More generally, following Keams et al. iHtIi . one can analo- 
gously define the leamability of a set of distribution-function pairs over the same domain X. Namely, a set of 
distribution-function pairs Z, is PAC leamable if there exists a learning algorithm that learns / over D (as in 
the definition above) for every (D, /) € Z,- 

A weak learning algorithm |16] is a learning algorithm that produces a hypothesis whose disagreement 
with the target concept is noticeably less than 1 /2 (and not necessarily any e > 0). More precisely, a weak 
learning algorithm produces a Boolean hypothesis h such that Pro[/(x) + h(x)\ < 1/2 - l/p{n) for some 
fixed polynomial p. 

2.2 The Statistical Query Learning Model 

In the statistical query model of Kearns ifTsll the learning algorithm is given access to STAT(/, D)-a statisti- 
cal query oracle for target concept / with respect to distribution D instead of EX(/, D). A query to this oracle 
is a function if/ : X x {-1, 1) — > {-1, 1). The oracle may respond to the query with any value v satisfying 
|Eo[i/'(x, /(x))] -v\<T where re [0, 1] is a real number called the tolerance of the query. An algorithm J[ is 
said to learn C in time t from statistical queries of tolerance t if J[ PAC leams C using STAT(/, D) in place 
of the example oracle. In addition, each query if/ made by J[ has tolerance t and can be evaluated in time f. 

The algorithm is said to (efliciently) SQ learn C if f is polynomial in n and 1/e, and t is lower-bounded 
by the inverse of a polynomial in n and 1/e. 

A correlational statistical query is a statistical query for a correlation of a function over X with the target 
HI]. Namely the query function i^(x, €) = (p(x) ■ { for a function (f> e T"^ . A concept class is said to be CSQ 
learnable if it is leamable by a SQ algorithm that uses only CSQ queries. 

2.3 Evolvability 

We start by presenting a brief overview of the model. For a detailed description and intuition behind the 
various choices made in model the reader is referred to [|27l 01- The goal of the model is to specify how 
organisms can acquire complex mechanisms via a resource-efficient process based on random mutations 
and guided by performance-based selection. The mechanisms are described in terms of the multi argument 
functions they implement. The performance of such a mechanism is measured by evaluating the agreement 
of the mechanism with some "ideal" behavior function. The value of the "ideal" function on some input 
describes the most beneficial behavior for the condition represented by the input. The evaluation of the 
agreement with the "ideal" function is derived by evaluating the function on a moderate number of inputs 
drawn from a probability distribution over the conditions that arise. These evaluations correspond to the 
experiences of one or more organisms that embody the mechanism. 
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Random variation is modeled by the existence of an explicit algorithm that acts on some fixed representa- 
tion of mechanisms and for each representation of a mechanism produces representations of mutated versions 
of the mechanism. The model requires that the mutation algorithm be efficiently implementable. Selection 
is modeled by an explicit rule that determines the probabilities with which each of the mutations of a mech- 
anism will be chosen to "survive" based on the performance of all the mutations of the mechanism and the 
probabilities with which each of the mutations is produced by the mutation algorithm. 

As can be seen from the above description, a performance landscape (given by a specific "ideal" function 
and a distribution over the domain), a mutation algorithm, and a selection rule jointly determine how each 
step of an evolutionary process is performed. A class of functions C is considered evolvable if there exist 
a representation of mechanisms R and a mutation algorithm M such that for every "ideal" function / e 
C, a sequence of evolutionary steps starting from any representation in R and performed according to the 
description above "converges" in a polynomial number of steps to /. This process is essentially PAC learning 
of C with the selection rule (rather than explicit examples) providing the only target-specific feedback. We 
now define the model formally using the notation from ^ . 

2.4 Definition of Evolvability 

The description of an evolution algorithm J[. consists of the definition of the representation class R of possibly 
randomized hypotheses in and the description of polynomial time mutation algorithm that for every r e R 
and e > outputs a random mutation of r. 

Definition 2.1 A evolution algorithm J[ is defined by a pair (R, M) where 

• R is a representation class of functions overX with range in [-1, 1]. 

• M is a randomized polynomial time algorithm that, given r € R and e as input, outputs a representation 
r\ & R with probability Vrj\{r, ri). The set of representations that can be output by M(r, e) is referred 
to as the neighborhood of r for e and denoted by Neighyi(r, e). 

A loss function L on a set of values F is a non-negative mapping L : Y x Y R^. L(y,y') measures 
the "distance" between the desired value y and the predicted value y'. In the context of learning Boolean 
functions using hypotheses with values in [-1,1] we only consider functions L : {-1, 1} x [-1, 1] R^. 
Valiant's original model only considers Boolean hypotheses and hence only the disagreement loss (or 0-1 
loss) which is equal to LA(y, y') - y ■ y' ■ It was shown in our earlier work [5] that such loss is equivalent to 
the linear loss L\(y, y') - \y' - y\ over hypotheses with the range in [-1, 1]. The other loss function we use 
here is the quadratic loss Lqiy, y') = (y' - y)^ function. For a function 4> 6 'F^ its performance relative to 
loss function L, distribution D over the domain and target function / is defined as 

LPerfy(0,D) = 1 - 2 ■ Ez,[L(/(x),,^(x))]/L(-l, 1) . 

For an integer s, functions 0, / e over X, distribution D over X and loss function L, the empirical fitness 
LPer£/(0, D, s) of is a random variable that equals 1 - Ijie[j] L(f(zi),(p{Zi)) for zi,Z2, ■ ■ ■ ,Zs e X 

chosen randomly and independently according to D. 

A number of natural ways of modeling selection were discussed in prior work ['zt', 's']. For concreteness 
here we describe the selection rule used in Vahant's main definition in a slightly generalized version from 
isl]- In selection rule SelNB[L, f, p, s] p candidate mutations are sampled using the mutation algorithm. Then 
beneficial and neutral mutations are defined on the basis of their empirical fitness LPerf in s experiments (or 
examples) using tolerance t. If some beneficial mutations are available one is chosen randomly according to 
their relative frequencies in the candidate pool. If none is available then one of the neutral mutations is output 
randomly according to their relative frequencies. If neither neutral or beneficial mutations are available, ± is 
output to mean that no mutation "survived". 

Definition 2.2 For a loss function L, tolerance t, candidate pool size p, sample size s, selection rule SelNB[L, f, p, s] 
is an algorithm that for any function f, distribution D, mutation algorithm J{ — (R, M), a representation 
r e R, accuracy e, SelNB[L, f, p, s](/, D, JK, r) outputs a random variable that takes a value r\ determined as 
follows. First run M(r, e) p times and let Z be the set of representations obtained. For r' e Z, let Prz(r') be the 
relative frequency with which r' was generated among the p observed representations. For each r' eZVJ [r], 
compute an empirical value of fitness v{r') — LPerf f{r' , D, s). Let Bene(Z) = {r' \ v{r') > v(r) + f) and 
Neut(Z) = {r' | \v(r') - u(r)\ < t). Then 
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(i) ifBene(Z) + then output r\ e Bene with probability Prz(ri)/ 2f'eBene(Z) Prz(r'); 



{ii) i/BeneCZ) = 0flnt/Neut(Z) % then output r\ € Neut(Z) w/f/i profeafe/Z/fy Prz(ri)/ ^^/g^emfz) Prz(''')- 

fn/j 7/'Neut(Z) U Bene(Z) = f/zen oMf/7Mf ±. 

A concept class C is said to be evolvable by an evolution algorithm ^ guided by a selection rule Sel 
over distribution D if for every target concept / € C, mutation steps as defined by J{ and guided by Sel will 
converge to /. For simplicity here we only consider the selection rule SelNB. 

Definition 2.3 For concept class C over X, distribution D, mutation algorithm loss function L we say that 
the class C is evolvable over D by J{ with L if there exist polynomials l/t(n, 1/e), s(n, 1/e), p(n, l/e) and 
g(n, 1/e) such that for every n, f e C, e > 0, and every ro e R, with probability at least I - e, a sequence 
ro, ri, r2, ■ ■ ., where r, <— SeWB[L, t, p, s]{f, D,^, rj^i) will have LPerffirg^ni/^-,, D) > 1 - e. 

As in PAC learning, we say that a concept class C is evolvable if it is evolvable over all distributions by a 
single evolution algorithm (we emphasize this by saying distribution-independently evolvable). Similarly, we 
say that a class of distribution-function pairs Z. is evolvable if the evolution algorithm is successful for all 
pairs (D, f) e Z- 

We say that an evolution algorithm J[ evolves C over D monotonically if with probability at least 1 - e, for 
every i < g{n, 1 /e), LPerf /(r,, D) > LPerf /(ro, D), where g(n, 1/e) and ro, ri , r2, . . . are defined as above. 
Note that since the evolution algorithm can be started in any representation, this is equivalent to requiring 
that with probability at least 1 - e, LPerf /(r/+i, D) > LPerf /{ri, D) for eveiy /. 



3 Lower Bounds on Distribution-Independent CSQ Learnability 

In this section we demonstrate that conjunctions are not evolvable with Boolean loss (or the equivalent linear 
loss). We obtain this result by exploiting the equivalence of evolvability with Boolean loss and efficient CSQ 
learnability. Our technique is based on a combinatorial parameter of a concept class C, referred to CSQD that 
lower bounds the complexity of distribution-independent CSQ learning of C. This parameter can be seen as 
a generalization of the approximation-based strong statistical query dimension given in our earlier work |Q 
to the distribution-independent setting. 

Definition 3.1 For a concept class C, and e, t > we define CSQD(C, e, t) as the smallest number d for 
which it holds that for every distribution D and function i// e 'F^, there exists a set of d functions C 'F^ 
and a Boolean function h^ such that for every f e C and distribution D' , at least one of the following 
conditions holds: 

1. there exists g e such that \{f, g)^! - {i//, g}[)\ >r or 

2. VrD'[f{x) + h^{x)]<e. 

We now give a simple proof that CSQD(C, e, t) lower bounds the number of correlational statistical queries 
of tolerance t required to leam C distribution-independently to accuracy e. Our proof is based on the proof 
of the analogous result for the strong SQ dimension [6]. 

Tlieorem 3.2 If C is learnable by a deterministic CSQ algorithm that uses q(n, 1/e) queries of tolerance 
T{n, 1/e) then CSQD{C, e,T(n, 1/e)) < q(n, 1/e). 

Proof: Let ^ be the assumed CSQ learning algorithm for C. Let ip e F^ be any function and D be any 
distribution. The set G^ and function h^ are constructed as follows. Simulate algorithm ^ and for every 
correlational query ((/>, ■ {, r) add 0,- to G^ and respond with {i//, (pj)^ - E/5[0,(x) ■ i/'(x)] to the query. Continue 
the simulation until ^ outputs a hypothesis. Let h^ be the hypothesis output by 

First, by the definition of G^, \G^\ < q{n, 1/e). Now, let / be any function in C and D' be a distribution. 
If there does not exist g & G^ such that |(/, g}D' - (lA, g}D\ > t (the first condition) then for every correlational 
query function 0,- e G,/,, {tfr, 4>i)D is within r of (/, <^,)d'- Therefore the answers provided by our simulator are 
valid for the execution of J[ when the target function is / and the distribution is D' . That is they could have 
been returned by STAT(/, D') with tolerance t. Therefore, by the definition of the hypothesis h^ satisfies 
Vro'ifix) + h^(xy\ < e (the second condition). ■ 
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3.1 Conjunctions are not CSQ Learnable Distribution-Independently 

We now demonstrate that for a carefully constructed set of distributions and conjunctions, no polynomial-size 
approximating set satisfying the conditions of Definition 13 . II exists. Let U be the uniform distribution over 
X - ]". For a set 5 c [«] we denote by ts (x) a conjunction of the variables with indices in S and xs {x) 
the parity function of the variables with indices in 5 . A well-known fact about the Fourier representation of 



conjunctions (e.g. Ill2ll ) is that 

fsW = -l+2-'^i+'2^/(x). 

IQS 

To obtain the desired lower bound we note that any pair (D, g) where g is a real- valued function over {0, 1)" 
and D is a distribution can be viewed as a real- valued function g'{x) - g{x)D(x)/U(x) - 2" ■ g(x)D{x). 
Here and below D(x) refers to the probability density function of D. By definition, for every x, g(x)D{x) = 
g'(x)U(x) and therefore for any real-valued function h, {h, g)r, - {h, g')u- This simple transformation allows 
us to view distribution-function pairs as functions over the uniform distribution and vice versa. 

The basis of our constructions are functions whose Fourier transform equals to the Fourier transform of 
tsix) but with all the Fourier coeflicients for non-empty sets of size at most removed. We claim that these 
functions can be seen as conjunctions over a close-to-uniform distribution. 

Lemma 3.3 Let k > 6 be an integer divisible by 3 and let S c [n] be any set of size k. There exists a function 
6s (x) and distribution Dg such that for every point x, Dg (x)ts (x) - U (x)0s (x) and in addition 

1. 9s(x) - + 2"i'^'^' + YjIcs, \i\>k/iXi(x))for a constant a e [2/3,2]. 

2. for every X, Ds(x)/Uix)e [1/3,3]. 

Proof: Let (f>six) - -I + 2"*^' + 2/cs, \i\>k/3Xi(x), in other words with all the parities for subsets of size 
/ 6 [k/3] removed. Note that the total number of parities that were removed from ts(x) is (^^^^ - 1 < 2*"^. 
Therefore for every x, 



\ts(x)-cf>six)\<2-''^' 



IQS, mk/3] 



<2-*+' ■2*-2 = 1/2. 



This implies that for every X, sigii(^s(x)) - sign(tsix)) and Li((ps) - Eu[\(ps(x)\] e [1/2,3/2]. Now 
let Ds(x) = U(x) ■ \(l>s(x)\/Li(<ps) and Os(x) - (ps{x)/L[(<ps)- This definition implies that Yjxex Ds(x) = 
Eu[\<Ps(x)\]/Li(<ps) = 1. Hence Ds (x) is a valid probability density function over (0, 1)". Further, D5 (x)?^ (x) = 
U(x)6s{x). In other words the conjunction over the distribution Ds can be viewed as the function 6s (x) 
over the uniform distribution. Finally, note that a - 1 IL\{(ps) e [2/3, 2] and Ds{x)/U{x) = \(l)six)\/Li{<ps) e 
[1/3,3]. ■ 

We now estabhsh that the number of monotone conjunctions of k variables such any two conjunctions 
share at most k/3 variables is large. 

Lemma 3.4 For any integer k e [6..n/2] divisible by 3, there exists a set Sk £ 2^"\ such that 

• for every S e Sk, \S \ — k; 

• for every distinct S,T e Sk, \S r\T\ < k/3; 
. \Sk\ > in/mf'^ + 1- 

Proof: There are {^^ different size-^ subsets of [n] and each subset of size k shares more than k/3 elements 
with at most iykjil^'iki'i) o^h^i" subsets of size k. Hence by greedily constructing Sk we will obtain at least 

(I) ^ n\ ■ (2k/3)f ■ (k/3)\ ^ «_ 
{k%r^^|^y («-fc/3)!.^!^ -^8^^ 

subsets. ■ 

We are now ready to show that conjunctions of superconstant size are not CSQ learnable to subconstant 
accuracy. Let Ck denote the concept class of conjunctions of size at most k. 
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Theorem 3.5 IfCk is CSQ leamable to accuracy e <2 1 6 by a deterministic algorithm that uses q queries 
of tolerance t then q/r^ > (^)*^-^/16. 



Proof: We apply Theorem 13.21 to the assumed CSQ algorithm for and obtain that CSQD(Ck, e,T) < q. 
Let iff{x) = cc (-1 + 2~*'*''^ and D be the uniform distribution. By the Definition 13. II there exists a set G of q 
functions and a Boolean function h such that for every f & Cu and distribution D' at least one of the following 
conditions holds: 

1. VrD'[f{x) h{x)] < eor 

2. there exists g & G such that \{f, g)o' - {4>, g)u\ ^ t- 

Let Sk be the set given by Lemma [34l and S e Sk- We apply these conditions to f - ts and distribution 
Ds defined in Lemma [33] to obtain that Pr^j [ts(x) + h(x)\ < e or there exists g e G such that \{ts ,g)Ds ~ 
{if/, g)u\ > T. We first consider the implications of the first condition. By our assumption e < 2"*/6. For any 
two subsets S,T e Sk, PruUs + tj] > 2"*. This implies that if Pr^, [ts ^ h] < e then 

Prults *h]= ^(-^) Tj 3 ■ D5 (x) = 3 ■ Pro, [ts + h\ <3e< 2-^2 , (1) 

where (*) is implied by property 2 in Lemma |33] Further, Pr(/[f7- ^ h] > Prij[ts + tj\ -Vvu[ts +h\> 2"*/2 
and hence, by the same argument as equation ([T]i, 

Projfr +h\> VvvVt + /i]/3 > (2-V2)/3 = 2-^/6 > e . 

In other words, h can be e close to at most one conjunction ts for S e Sk. 

Now consider a subset S for which the second condition holds. By the definition of Ds, {ts,g)Ds - 
{Ss , g)u and therefore the second condition is equivalent to 



We observe that Gs - i// = a2 ■ YjIcs, \i\>k/3Xi ^^d hence 



m -4',g}u\>T . 

|/|>/r/3/ 



T < 



ICS, \I\>k/3 



<a2-*+'- 2 \{xi,g)u\- (2) 

IQS, \I\>k/3 



Equation (O implies that there exists Is Q S such that \Is \ > k/3 and 

\{Xis,g}u\ > T ■ 2*-V(ff2*) > T/(2 ■ a) > T/4 . 
We now need two crucial observations: 

1. For distinct 5, T € Sk, Is + h- This is true since Is is a subset of size at least kl3 + 1 of 5 and S shares 
at most kl3 elements with T (of which It is a subset). 

2. For any function g € T^, there exist at most sets / such that \(xi, g)u\ ^ t/A. This is true since 
iXi, g)u is simply the Fourier coeflicient of g with index / denoted by §(/). Parseval's identity states 
that Yiic.[n] gi^f- - \\g\^u < 1 and therefore no more than 16/t- Fourier coeflicients of g can be larger 
than t/4. 

Combining these two observations gives that the number of subsets of Sk for which the second condition 
holds is at most 16 ■ qlr^. By combining this with the fact that the first condition can hold for at most one set 
in Sk we obtain that 16 ■ qlr^ > \Sk\ - 1 > {0''^- ■ 

Remark 3.6 Theorem \3.5\ also applies to CSQ learning by randomized algorithms since a randomized algo- 
rithm for the set of conjunction-distribution pairs we consider can be converted to a non-uniform determin- 
istic algorithm via a standard transformation (e.g. [1 ]). 

Corollary 3.7 For any k - w(l) and e - o{\), Ck is not evolvable (distribution-independently) to accuracy 
e. 

Interestingly, conjunctions are known to be weakly CSQ learnable (distribution-independently) . There- 
fore Corollarv li.Tl also implies that traditional boosting algorithms ll24l[loll cannot be adapted to CSQ learning 
(and hence evolvability). 
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4 Evolvability of Halfspaces with Non-Linear Loss Functions 



In this section we demonstrate that halfspaces are evolvable distribution-independently for a wide class of 
loss functions using a polynomial in n, 1/e and 1/y amount of resources. Here j is the margin of the target 
halfspace on the domain X of the learning problem. For example, if we set X = {-1/ 1/ V")" (or the 
Boolean hypercube scaled to fit in B„) then all functions that can be represented by a halfspace with integer 
weights upper-bounded in absolute value by m, will have the margin of at least l/(«m). Consequently, our 
result implies that such functions are evolvable distribution-independently over the Boolean hypercube for 
any m upper-bounded by a polynomial in n. This class of functions includes conjunctions, disjunctions, 
decision lists of length (9(log«) and majority functions. The mutation algorithm we use is very simple and 
natural for evolving halfspaces. The only operations it requires are adding a ■ x, to the current function for a 
real a and bounding the value of the function to be in [-1, 1]. 

A more general way to describe this result is to take the domain to be B„ and define the margin relative 
to the support of the target distribution. Specifically, let HSy denote the set of distribution-function pairs over 
Bn such that (D, /) e HSy, if (and only if) / can be represented by a halfspace with margin y on the support 
of distribution D (for brevity we use "margin on D" to refer to the margin on the support of D). For X c B„, 
we denote by HSy(X) the set of all functions that can be represented by a halfspace with margin y on X. 

Our proof of evolvability relies on the lemma which proves that for every current hypothesis e T'^ , 
there exists an efficiently computable and small neighborhood A^((/>) of such that for every target halfspace / 
with margin y on distribution D, if the fitness of is not e-close to the optimum then there exist (f>' e N whose 
fitness is observably higher than the fitness of (p. Following Kanade et al. we refer to such function 
as strictly beneficial neighborhood function. Strictly beneficial neighborhood function immediately implies 
monotone evolvability from any starting function [6]. To see this observe that for a mutation algorithm that 
produces a random member of the strictly beneficial neighborhood, every step of the evolution algorithm will 
increase performance by an inverse-polynomial amount until it reaches 1 - e. Further as it was observed by 
Kanade et al. 1141 . it also implies evolvability when the target function is allowed to change gradually, or 
drift. 

We first show the existence of a strictly beneficial neighborhood function for halfspaces with the quadratic 
loss function and then examine the conditions on the loss function that allow a similar argument to go through. 
For fl e R, define 

a |a| < 1 

sign(a) otherwise. 



^i(fl) 



Theorem 4.1 For (p{x) e T^, let 

N,M) = {Pi(cf> + a' ■ Xi) I / e [0..n], \a'\ = a) U {0). 
For every halfspace f with margin y on distribution D and every e > 0, there exists (f>' e Na{(p)for which 

||/-0'||^<max{||/-0||^-a2,e} , 

where a = 

3 V« 

Proof: Let / = sign(2,g[„] w,x, - ff) be the representation of / that has margin y on D. The claim holds if 
11/ - 011^ < e. We can therefore assume that \\ f - 0||^ > e. In particular, since for every a e [-2, 2], |a| > a^/2 
we obtain thatEB[|/ - (f>\] > e/2. 

For every x in the support of D, f{x) - <p(x) has the same sign as f(x) and therefore also the same sign as 
2/e[„] WiXi - 6. Therefore, 



^ uJiXi - 9 



\ 

Ed (/-0) ^ 

At the same time, using the Cauchy-Schwartz inequality we can obtain 
Ez, if-' 



>rE/,[|/-0|]>er/2. (3) 



^ WiXi -0 MEoKf - ^)xi]\ + mEolif - ^)] < le^ + J]wj- I ^ Eolif - cf>)xi]^ 

iE[n] jj /£[«] V !£[n] y/e[0..n] 

< V2 J 2 Ed[(/-0)x,]2. (4) 

y !e[0..n] 
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By combining equations (O and (|4| we obtain that 

X Eo[(/-0)x,]2>(er)2/8. 

ie[0..n] 

From here we can conclude that there exists j e [0..n] such that 

|Efl[(/ - c/>)xj]\ > eyi ^l%{n + 1) > ey/O V^). (5) 

Now we claim that a step in the direction of Xj from will decrease the distance (in || ■ \\o norm) to /. 
Formally, 

Lemma 4.2 Let a' — a ■ sign(E£)[(/ — (p)xj'\), where a — (as defined in the statement of the theorem). 
Then 

Proof: 

11/ - (0 + a' ■ Xj)\\l = 11/ - + a'^WxjWl - 2{f - 0, a' ■ xj}o. 
To obtain the claim it remains to observe that \\xj\\^ < 1 and that 

2{f -^,a' ■ Xj)o = 2a''EB[(/ - (p)xj\ > 2a'^ = 2cp- . 

■(Lem.ES 

Now let (/>' = P\((p + a' ■ xj). If for a point x, (p'ix) - (p(x) + a' ■ xj then clearly f(x) - <p'{x) - 
fix) - {(p{x) + a' ■ Xj). Otherwise, if \(p{x) + a' ■ Xj\ > 1 then (f>'{x) - sign((^(;c) + a' ■ xj) and for any value 
fix) € {-1, 1), |/(x) - ^'{x)\ < \f(x) - (4>(x) + a' ■ xj)\. This impUes that 

IK/ - <P')\\l < 11/ - (0 + a' ■ xj)\\l < 11/ - - a\ 

By definition, 0' e Na{<p) and hence we obtain the claimed result. ■ 

We now demonstrate that a similar result can be obtained under several mild conditions on the loss 
function. In essence, we require that the loss function can be well approximated by a linear function with a 
slope that is not too close to 0. Formally, 

Definition 4.3 For positive constants a, A and B we say that a loss function L : {— 1, 1) x [-1, 1] — > is 
well-behaved with bounds a. A, B if 

1. L(-1,-1) = L(1,1) = 0; 

2. L(1,-1) = L(-1,1) = 2; 

3. for I e {-1,1), L{{,z) is twice dijferentiable in [-1,1] (the differentiation is always in the second 
variable); 

4. for £e {-1,1}, L'{t, €)^0 and -€ ■ L'il, e{l-z))>A- L((, £(1 - z))"; 

5. for £e {-1,1}, for every ze [-1,1], L"({,z)<B. 

We remark that condition (2) is for convenience only and can be achieved by scaling any loss function 
satisfying the other conditions. Condition (4) ensures that the loss function is monotone (that is for all 
y,y' € [-1, 1], if J/ < y' then L{-l,y) < L{-l,y') and L{l,y') < L{l,y)) and that it has a non-negligible 
slope whenever the loss itself is non-negligible. Condition (5) ensures that the linear approximation to L 
dominates the remainder term in the Taylor series. A simple example of a well-behaved loss function is 
L{y,z) - \y - zl'^/2'^"' for any constant c > 2. It is also easy to see that any convex combination of well- 
behaved loss functions is well-behaved. We now prove a generahzation of Theorem |4.1| to well-behaved loss 
functions. 
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Theorem 4.4 Let L be a well-behaved loss function with bounds a, A and B. For (p{x) e '7^°. 

N„i(f>) = {Pi(0 + a' ■ Xi) I / e [Q..n], \a'\ = a) U {0). 
For every halfspace f with margin y on distribution D and every e > 0, there exists <p' e Na{_4>)for which 

Eo[L(/, 0')] < max{EB[L(/, 0)] - a" ■ B/2, e) , 
where a = A ■ y ■ e"^^ /(B ■ 2"^^ V")- 

Proof: As before, we can assume that E£)[L(/, 0)] > e. In particular, Pro[L(/, > e/2] > e/4. Then, by 
property (4) of well-behaved loss-functions, Pr/)[|L'(/, <p)\ > Aie/l)"] > e/4. This implies that 

Ed[|L'(/,0)|] > e/4 ■ a ■ (e/2r = A ■ e"^' /2"^\ 

By monotonicity of L (or property (4)), for every x in the support of D, -L'(f(x), <pix)) has the same sign 
as f(x) and therefore also the same sign as 2iE[n] '^iXi - 6- This gives 



Er 



V/e[n] 



a+l /ofl+2 



In addition, as in equation (IDl, we have 



Er 



Y^WiXi-e < V2 / 2 Eo[L'(/, 

(£[«] jj y!e[0..«] 



(6) 



(7) 



By combining equations (|6| and (|7]) we obtain that 

^ Ez,[L'(/,0)x,]2 > A2 . y2^2a+2/22 

IE[0..«] 

From here we can conclude that there exists j e [0..n] such that 



|Eo[L'(/,0)x,]| > A ■ r ■ e"+V(2"+' V2n + 2) > A ■ y ■ e''+V(2"^' V^)- 



(8) 



We denote the right side of this inequality by p. 

To finish the proof we prove an analogue of Lemma W2\ saving that a step in the direction of Xj from (p 
will decrease the loss. Formally, 

Lemma 4.5 Let a' — —a ■ sign(Eo[L'(f, 4>)Xj]), and (p' — P\(<p + a' ■ xj), where a — p/B (as defined in the 
statement of the theorem). Then 

EolUf, 0')] < Eo[L(/, 0)] - ^2 . B/2 . 

Proof: Let x e Xhe any point. Assume that f(x) - -1. For convenience we extend the loss function L{-l,z) 
to values z e [-2,-1) by setting L{-l,z) = L(-l,-2 - z) (that is by making the loss symmetric around 
-1). By the properties of the loss function, L(-l,-l) = 0, L'(-l,-l) = Oandforze [-2,-1), L"(-l,z) - 
L"(-l, -2 - z). This implies that the extended L is twice differentiable in [-2, 1] and L"{-l,z) < B for every 
z e [-2, 1]. We first assume that (/) + a' ■ xj e [-2, 1]. L is twice differentiable and therefore Taylor's theorem 
gives 

L(-l,(f>(x) + a' ■ Xj) - L(-l, (f>(x)) = a' ■ xj ■ L'(-l, (f>{x)) + (a ■ Xjf ■ L" (-1,0/2, 

where f e [(f>(x), (f)(x) + a' ■ xj] c [-2, 1]. Also note that in this case, L(-l,(l)(x) + a' ■ xj) > L(-l,(p'(x)). This 
means that 



L(-l,^'(x)) - L(-l,4>(x)) < a' ■ Xj ■ L'(-l,0(jc)) + ■ B/2, 



(9) 



Nowif^-Hff'-jcy > 1 then0'(jt;) = 1 anda'-xj > l-^(x) > 0. Then Qf'-jt;j-L'(-l, (^(;c)) > (l-(p(x))-L'(-l,(p(x)) 
(as L'(-l,(p(x))>0). Hence, 
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L(-l,0'(jc))-L(-l,0(x)) = {\-^(x))-Xj-L'{-\,<p(x))+({l-^(x))-Xjf-L''(-\,OI2 < a' ■Xj-L'{-\,4>{x))+cp--BI2, 

(10) 

where ( € [0(x), 1]. By treating the case when f{x) - 1 symmetrically and combining equations (|9]l and (fTOl l 
we will obtain that for every x, 

L(f(x), <f,'(x)) - L(f(x), <p{x)) < a' ■ Xj ■ L'(f(x), 4>{x)) + a" ■ B/2. 

This immediately implies that 

Eo[L(/(^), (l)\x))] - Eo[L(/(x), 0(x))] < a'-EoYxj ■ L'(f(x), 0(x))] + ■ B/2 < -ap + ■ 5/2 = ■ B/2 . 

■(Lem.|431) 

To finish the proof we observe that (f>'(x) e Ncri(f>)- M 

As we have mentioned, a simple corollary of Theorem 14 . 4 | i s distribution-independent evolvability of large 
margin halfspaces with any well-behaved loss function. 

Theorem 4.6 For every well-behaved loss function L and y > 1 /q(n)for some polynomial q{-), HSy over B„ 
is monotonically evolvable with L. 

We make two remarks regarding these theorems. 



Remark 4.7 In both Theorems \4.1\ and \4.4\ it is not necessary to know the exact value of a to create a 
strictly beneficial neighborhood. It is easy to see from the analysis that the bound holds for every oq < 
maXje[o..„]{|E£)[L'(/, 0)jiCj])|. Therefore by including in the neighborhood steps for all values of — 2^' for 
t € [n], the neighborhood will include a function with at least 1/4 of the improvement that can be achieved 
when a bound on a is known in advance. 



Remark 4.8 Theorem \4.4\ does not require the loss function to be the same for all x as long as for every point 
X, the loss-function is well-behaved with the same bounds a,A,B. Similarly the loss function does not need 
to stay the same between generations and can change arbitrarily as long as it is well-behaved with the same 
bounds a. A, B 

A number of popular machine learning algorithms work by embedding the data points in a different Eu- 
clidean space (most commonly by using a kernel) and then applying a learning algorithm for halfspaces, such 
as SVM. This method is also used in a number of theoretical algorithms such as the DNF learning algo- 
rithm based on the polynomial threshold function representation of Klivans and Servedio 1 18]. As expected, 
this technique can be easily translated to the evolvability framework and then used together with our result. 
Formally, let C and C be concept classes over the domains X and X' , respectively. The concept C over X 
is said to be embeddable as C over X' if there exists a function : X ^ X' such that for every f e C, 
there exists g e C such that for every x e X, g(<t>(x)) - fix). We also say that the embedding is efficient 
if <l)(x) is computable efficiently, that is in time polynomial in the dimension of x (or description length in 
general). Embeddability of concept classes into large-margin halfspaces has been studied in a number of 
works initiated by Forster [8] and Forster et al. [9] (see [2Q, ,2^, JL2J for some recent results). The inverse 
of the optimal margin is referred to as the margin complexity of a concept class lEoll . Besides its impor- 
tance to machine learning, it has several connections to fundamental quantities in communication complexity 



111. 25iil9|]. We cannot invoke this measure directly to upper-bound the complexity of using our evolution 
algorithm since margin complexity disregards the computational complexity of the embedding function. But 
given an efficient embedding function the application of our evolution algorithm becomes straightforward. 

Corollary 4.9 Let C be a concept class over domain X, X' C B„ and y > l/q(n)for some polynomial 
q(-). If there exists an efficiently computable embedding of C over X to ESy(X') over X', then C is evolvable 
monotonically with any well-behaved loss function. 
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